Diego Alzate

🤖 Proyectos de IA

  • Natural Language Processing
  • Computer Vision
  • Deep Learning
  • Supervised Learning
  • Unsupervised Learning
  • Ensemble Techniques

📍 Navegación

  • Inicio
  • Proyectos

Credit Card Users Churn Prediction¶

Problem Statement¶

Business Context¶

The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.

Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas

You as a Data scientist at Thera bank need to come up with a classification model that will help the bank improve its services so that customers do not renounce their credit cards

Data Description¶

  • CLIENTNUM: Client number. Unique identifier for the customer holding the account
  • Attrition_Flag: Internal event (customer activity) variable - if the account is closed then "Attrited Customer" else "Existing Customer"
  • Customer_Age: Age in Years
  • Gender: Gender of the account holder
  • Dependent_count: Number of dependents
  • Education_Level: Educational Qualification of the account holder - Graduate, High School, Unknown, Uneducated, College(refers to college student), Post-Graduate, Doctorate
  • Marital_Status: Marital Status of the account holder
  • Income_Category: Annual Income Category of the account holder
  • Card_Category: Type of Card
  • Months_on_book: Period of relationship with the bank (in months)
  • Total_Relationship_Count: Total no. of products held by the customer
  • Months_Inactive_12_mon: No. of months inactive in the last 12 months
  • Contacts_Count_12_mon: No. of Contacts in the last 12 months
  • Credit_Limit: Credit Limit on the Credit Card
  • Total_Revolving_Bal: Total Revolving Balance on the Credit Card
  • Avg_Open_To_Buy: Open to Buy Credit Line (Average of last 12 months)
  • Total_Amt_Chng_Q4_Q1: Change in Transaction Amount (Q4 over Q1)
  • Total_Trans_Amt: Total Transaction Amount (Last 12 months)
  • Total_Trans_Ct: Total Transaction Count (Last 12 months)
  • Total_Ct_Chng_Q4_Q1: Change in Transaction Count (Q4 over Q1)
  • Avg_Utilization_Ratio: Average Card Utilization Ratio

What Is a Revolving Balance?¶

  • If we don't pay the balance of the revolving credit account in full every month, the unpaid portion carries over to the next month. That's called a revolving balance
What is the Average Open to buy?¶
  • 'Open to Buy' means the amount left on your credit card to use. Now, this column represents the average of this value for the last 12 months.
What is the Average utilization Ratio?¶
  • The Avg_Utilization_Ratio represents how much of the available credit the customer spent. This is useful for calculating credit scores.
Relation b/w Avg_Open_To_Buy, Credit_Limit and Avg_Utilization_Ratio:¶
  • ( Avg_Open_To_Buy / Credit_Limit ) + Avg_Utilization_Ratio = 1

Please read the instructions carefully before starting the project.¶

This is a commented Jupyter IPython Notebook file in which all the instructions and tasks to be performed are mentioned.

  • Blanks '_______' are provided in the notebook that

needs to be filled with an appropriate code to get the correct result. With every '_______' blank, there is a comment that briefly describes what needs to be filled in the blank space.

  • Identify the task to be performed correctly, and only then proceed to write the required code.
  • Fill the code wherever asked by the commented lines like "# write your code here" or "# complete the code". Running incomplete code may throw error.
  • Please run the codes in a sequential manner from the beginning to avoid any unnecessary errors.
  • Add the results/observations (wherever mentioned) derived from the analysis in the presentation and submit the same.

Importing necessary libraries¶

In [ ]:
# Importing required libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# Importing classifiers
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
    AdaBoostClassifier,
    GradientBoostingClassifier,
    RandomForestClassifier,
    BaggingClassifier,
)
# Importing imputers and model evaluation metrics
from sklearn.impute import SimpleImputer
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from xgboost import XGBClassifier
from sklearn.metrics import (
    accuracy_score,
    f1_score,
    precision_score,
    recall_score,
)
# Importing resampling techniques
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
# Importing metrics
from sklearn import metrics

# To supress warnings
import warnings

warnings.filterwarnings("ignore")

Loading the dataset¶

In [ ]:
original_data = pd.read_csv("BankChurners.csv")

Data Overview¶

  • Observations
  • Sanity checks
In [ ]:
df = original_data.copy()
print(f"Number of rows: {df.shape[0] }, Number of columns {df.shape[1]}")
Number of rows: 10127, Number of columns 21
In [ ]:
df.head()
Out[ ]:
CLIENTNUM Attrition_Flag Customer_Age Gender Dependent_count Education_Level Marital_Status Income_Category Card_Category Months_on_book ... Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
0 768805383 Existing Customer 45 M 3 High School Married $60K - $80K Blue 39 ... 1 3 12691.0 777 11914.0 1.335 1144 42 1.625 0.061
1 818770008 Existing Customer 49 F 5 Graduate Single Less than $40K Blue 44 ... 1 2 8256.0 864 7392.0 1.541 1291 33 3.714 0.105
2 713982108 Existing Customer 51 M 3 Graduate Married $80K - $120K Blue 36 ... 1 0 3418.0 0 3418.0 2.594 1887 20 2.333 0.000
3 769911858 Existing Customer 40 F 4 High School NaN Less than $40K Blue 34 ... 4 1 3313.0 2517 796.0 1.405 1171 20 2.333 0.760
4 709106358 Existing Customer 40 M 3 Uneducated Married $60K - $80K Blue 21 ... 1 0 4716.0 0 4716.0 2.175 816 28 2.500 0.000

5 rows × 21 columns

In [ ]:
df.tail()
Out[ ]:
CLIENTNUM Attrition_Flag Customer_Age Gender Dependent_count Education_Level Marital_Status Income_Category Card_Category Months_on_book ... Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
10122 772366833 Existing Customer 50 M 2 Graduate Single $40K - $60K Blue 40 ... 2 3 4003.0 1851 2152.0 0.703 15476 117 0.857 0.462
10123 710638233 Attrited Customer 41 M 2 NaN Divorced $40K - $60K Blue 25 ... 2 3 4277.0 2186 2091.0 0.804 8764 69 0.683 0.511
10124 716506083 Attrited Customer 44 F 1 High School Married Less than $40K Blue 36 ... 3 4 5409.0 0 5409.0 0.819 10291 60 0.818 0.000
10125 717406983 Attrited Customer 30 M 2 Graduate NaN $40K - $60K Blue 36 ... 3 3 5281.0 0 5281.0 0.535 8395 62 0.722 0.000
10126 714337233 Attrited Customer 43 F 2 Graduate Married Less than $40K Silver 25 ... 2 4 10388.0 1961 8427.0 0.703 10294 61 0.649 0.189

5 rows × 21 columns

CLIENTNUM consists of uniques ID for clients and is not relevant

In [ ]:
df.drop(["CLIENTNUM"], axis=1, inplace=True)
In [ ]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10127 entries, 0 to 10126
Data columns (total 20 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Attrition_Flag            10127 non-null  object 
 1   Customer_Age              10127 non-null  int64  
 2   Gender                    10127 non-null  object 
 3   Dependent_count           10127 non-null  int64  
 4   Education_Level           8608 non-null   object 
 5   Marital_Status            9378 non-null   object 
 6   Income_Category           10127 non-null  object 
 7   Card_Category             10127 non-null  object 
 8   Months_on_book            10127 non-null  int64  
 9   Total_Relationship_Count  10127 non-null  int64  
 10  Months_Inactive_12_mon    10127 non-null  int64  
 11  Contacts_Count_12_mon     10127 non-null  int64  
 12  Credit_Limit              10127 non-null  float64
 13  Total_Revolving_Bal       10127 non-null  int64  
 14  Avg_Open_To_Buy           10127 non-null  float64
 15  Total_Amt_Chng_Q4_Q1      10127 non-null  float64
 16  Total_Trans_Amt           10127 non-null  int64  
 17  Total_Trans_Ct            10127 non-null  int64  
 18  Total_Ct_Chng_Q4_Q1       10127 non-null  float64
 19  Avg_Utilization_Ratio     10127 non-null  float64
dtypes: float64(5), int64(9), object(6)
memory usage: 1.5+ MB

Observations¶

  • The DataFrame contains 10,127 rows and 20 columns.
  • The columns represent different attributes of the data, such as customer information, account details, and transaction data.
  • The Attrition_Flag column is of type object, indicating it likely contains categorical data representing customer churn information.
  • The Customer_Age column is of type int64, suggesting it represents the age of the customers.
  • The Gender, Education_Level, Marital_Status, Income_Category, and Card_Category columns are also of type object, implying they contain categorical data representing customer demographics and financial information.
  • The Months_on_book, Total_Relationship_Count, Months_Inactive_12_mon, Contacts_Count_12_mon, - Total_Revolving_Bal, Total_Trans_Amt, Total_Trans_Ct, and Total_Ct_Chng_Q4_Q1 columns are of type int64, indicating they represent numerical values such as counts or durations.
  • The Credit_Limit, Avg_Open_To_Buy, Total_Amt_Chng_Q4_Q1, and Avg_Utilization_Ratio columns are of type float64, suggesting they represent continuous numerical values.
  • Some columns have missing values (non-null counts are less than 10,127). Specifically, the Education_Level and Marital_Status columns have missing values.
  • The memory usage of the DataFrame is approximately 1.5 MB.

Count the duplicated data

In [ ]:
df[df.duplicated()].count()
Out[ ]:
Attrition_Flag              0
Customer_Age                0
Gender                      0
Dependent_count             0
Education_Level             0
Marital_Status              0
Income_Category             0
Card_Category               0
Months_on_book              0
Total_Relationship_Count    0
Months_Inactive_12_mon      0
Contacts_Count_12_mon       0
Credit_Limit                0
Total_Revolving_Bal         0
Avg_Open_To_Buy             0
Total_Amt_Chng_Q4_Q1        0
Total_Trans_Amt             0
Total_Trans_Ct              0
Total_Ct_Chng_Q4_Q1         0
Avg_Utilization_Ratio       0
dtype: int64

Observations¶

  • There are no duplicated values in the dataset

Count null values

In [ ]:
df.isnull().sum()
Out[ ]:
Attrition_Flag                 0
Customer_Age                   0
Gender                         0
Dependent_count                0
Education_Level             1519
Marital_Status               749
Income_Category                0
Card_Category                  0
Months_on_book                 0
Total_Relationship_Count       0
Months_Inactive_12_mon         0
Contacts_Count_12_mon          0
Credit_Limit                   0
Total_Revolving_Bal            0
Avg_Open_To_Buy                0
Total_Amt_Chng_Q4_Q1           0
Total_Trans_Amt                0
Total_Trans_Ct                 0
Total_Ct_Chng_Q4_Q1            0
Avg_Utilization_Ratio          0
dtype: int64

Observations¶

Missing values

  • Education_Level 1519
  • Marital_Status 749
In [ ]:
df.describe(include='all').T
Out[ ]:
count unique top freq mean std min 25% 50% 75% max
Attrition_Flag 10127 2 Existing Customer 8500 NaN NaN NaN NaN NaN NaN NaN
Customer_Age 10127.0 NaN NaN NaN 46.32596 8.016814 26.0 41.0 46.0 52.0 73.0
Gender 10127 2 F 5358 NaN NaN NaN NaN NaN NaN NaN
Dependent_count 10127.0 NaN NaN NaN 2.346203 1.298908 0.0 1.0 2.0 3.0 5.0
Education_Level 8608 6 Graduate 3128 NaN NaN NaN NaN NaN NaN NaN
Marital_Status 9378 3 Married 4687 NaN NaN NaN NaN NaN NaN NaN
Income_Category 10127 6 Less than $40K 3561 NaN NaN NaN NaN NaN NaN NaN
Card_Category 10127 4 Blue 9436 NaN NaN NaN NaN NaN NaN NaN
Months_on_book 10127.0 NaN NaN NaN 35.928409 7.986416 13.0 31.0 36.0 40.0 56.0
Total_Relationship_Count 10127.0 NaN NaN NaN 3.81258 1.554408 1.0 3.0 4.0 5.0 6.0
Months_Inactive_12_mon 10127.0 NaN NaN NaN 2.341167 1.010622 0.0 2.0 2.0 3.0 6.0
Contacts_Count_12_mon 10127.0 NaN NaN NaN 2.455317 1.106225 0.0 2.0 2.0 3.0 6.0
Credit_Limit 10127.0 NaN NaN NaN 8631.953698 9088.77665 1438.3 2555.0 4549.0 11067.5 34516.0
Total_Revolving_Bal 10127.0 NaN NaN NaN 1162.814061 814.987335 0.0 359.0 1276.0 1784.0 2517.0
Avg_Open_To_Buy 10127.0 NaN NaN NaN 7469.139637 9090.685324 3.0 1324.5 3474.0 9859.0 34516.0
Total_Amt_Chng_Q4_Q1 10127.0 NaN NaN NaN 0.759941 0.219207 0.0 0.631 0.736 0.859 3.397
Total_Trans_Amt 10127.0 NaN NaN NaN 4404.086304 3397.129254 510.0 2155.5 3899.0 4741.0 18484.0
Total_Trans_Ct 10127.0 NaN NaN NaN 64.858695 23.47257 10.0 45.0 67.0 81.0 139.0
Total_Ct_Chng_Q4_Q1 10127.0 NaN NaN NaN 0.712222 0.238086 0.0 0.582 0.702 0.818 3.714
Avg_Utilization_Ratio 10127.0 NaN NaN NaN 0.274894 0.275691 0.0 0.023 0.176 0.503 0.999

Observations¶

  • Attrition_Flag: There are 10,127 entries. The unique values are 2, with "Existing Customer" being the most frequent entry (appearing 8,500 times).
  • Customer_Age: The column contains numerical data. The mean age is approximately 46.33 years, with a standard deviation of 8.02. The minimum age is 26, and the maximum age is 73.
  • Gender: There are 10,127 entries. The unique values are 2, with "F" (female) being the most frequent entry (appearing 5,358 times).
  • Dependent_count: The column contains numerical data. The mean count of dependents is approximately 2.35, with a standard deviation of 1.30. The minimum count is 0, and the maximum count is 5.
  • Education_Level: There are 8,608 entries. The unique values are 6, with "Graduate" being the most frequent entry (appearing 3,128 times).
  • Marital_Status: There are 9,378 entries. The unique values are 3, with "Married" being the most frequent entry (appearing 4,687 times).
  • Income_Category: There are 10,127 entries. The unique values are 6, with "Less than $40K" being the most frequent entry (appearing 3,561 times).
  • Card_Category: There are 10,127 entries. The unique values are 4, with "Blue" being the most frequent entry (appearing 9,436 times).
  • Months_on_book: The column contains numerical data. The mean number of months is approximately 35.93, with a standard deviation of 7.99. The minimum number of months is 13, and the maximum number of months is 56.
  • Total_Relationship_Count: The column contains numerical data. The mean count of total relationships is approximately 3.81, with a standard deviation of 1.55. The minimum count is 1, and the maximum count is 6.
  • Months_Inactive_12_mon: The column contains numerical data. The mean number of inactive months is approximately 2.34, with a standard deviation of 1.01. The minimum number of inactive months is 0, and the maximum number of inactive months is 6.
  • Contacts_Count_12_mon: The column contains numerical data. The mean number of contacts in the last 12 months is approximately 2.46, with a standard deviation of 1.11. The minimum count is 0, and the maximum count is 6.
  • Credit_Limit: The column contains numerical data. The mean credit limit is approximately 8,631.95, with a standard deviation of 9,088.78. The minimum credit limit is 1,438.30, and the maximum credit limit is 34,516.00.
  • Total_Revolving_Bal: The column contains numerical data. The mean total revolving balance is approximately 1,162.81, with a standard deviation of 814.99. The minimum balance is 0, and the maximum balance is 2,517.
  • Avg_Open_To_Buy: The column contains numerical data. The mean average open-to-buy amount is approximately 7,469.14, with a standard deviation of 9,090.69. The minimum amount is 3.00, and the maximum amount is 34,516.00.
  • Total_Amt_Chng_Q4_Q1: The column contains numerical data. The mean change in transaction amount from - Q4 to Q1 is approximately 0.76, with a standard deviation of 0.22. The minimum change is 0.0, and the maximum change is 3.40.
  • Total_Trans_Amt: The column contains numerical data. The mean total transaction amount is approximately 4,404.09, with a standard deviation of 3,397.13. The minimum amount is 510, and the maximum amount is 18,484.
  • Total_Trans_Ct: The column contains numerical data. The mean total transaction count is approximately 64.86, with a standard deviation of 23.47. The minimum count is 10, and the maximum count is 139.
  • Total_Ct_Chng_Q4_Q1: The column contains numerical data. The mean change in transaction count from Q4 to Q1 is approximately 0.71, with a standard deviation of 0.24. The minimum change is 0.0, and the maximum change is 3.71.
  • Avg_Utilization_Ratio: The column contains numerical data. The mean average utilization ratio is approximately 0.27, with a standard deviation of 0.28. The minimum ratio is 0.0, and the maximum ratio is 0.999.
In [ ]:
# Attrition_Flag
df["Attrition_Flag"].replace("Existing Customer", 0, inplace=True)
df["Attrition_Flag"].replace("Attrited Customer", 1, inplace=True)

# Convert the columns with an 'object' datatype into categorical variables
for feature in df.columns:
    if df[feature].dtype == 'object':
        df[feature] = pd.Categorical(df[feature])
In [ ]:
# Iterate over columns with 'category' data type in the DataFrame
for i in df.describe(include='category').columns:
    print("Unique values in", i, "are :")
    print(df[i].value_counts())
    print("*" * 50)
Unique values in Gender are :
Gender
F    5358
M    4769
Name: count, dtype: int64
**************************************************
Unique values in Education_Level are :
Education_Level
Graduate         3128
High School      2013
Uneducated       1487
College          1013
Post-Graduate     516
Doctorate         451
Name: count, dtype: int64
**************************************************
Unique values in Marital_Status are :
Marital_Status
Married     4687
Single      3943
Divorced     748
Name: count, dtype: int64
**************************************************
Unique values in Income_Category are :
Income_Category
Less than $40K    3561
$40K - $60K       1790
$80K - $120K      1535
$60K - $80K       1402
abc               1112
$120K +            727
Name: count, dtype: int64
**************************************************
Unique values in Card_Category are :
Card_Category
Blue        9436
Silver       555
Gold         116
Platinum      20
Name: count, dtype: int64
**************************************************

Observations¶

Gender:

  • There are two unique values in the "Gender" column: "F" (female) and "M" (male).
  • The count of females is 5,358.
  • The count of males is 4,769.

Education_Level:

  • There are six unique values in the "Education_Level" column: Graduate, High School, Uneducated, College, Post-Graduate, and Doctorate.
  • The count of customers with a Graduate education level is 3,128, making it the most common education level.
  • High School, Uneducated, College, Post-Graduate, and Doctorate follow in descending order of occurrence.

Marital_Status:

  • There are three unique values in the "Marital_Status" column: Married, Single, and Divorced.
  • The count of married customers is 4,687, making it the most common marital status.
  • Single and Divorced follow in descending order of occurrence.

Income_Category:

  • There are six unique values in the "Income_Category" column: Less than $40K, $40K - $60K, $80K - $120K, $60K - $80K, abc, and $120K+.
  • The count of customers with an income less than $40K is 3,561, making it the most common income category.
  • The "abc" value seems to be a non-standard entry or a placeholder, with a count of 1,112.

Card_Category:

  • There are four unique values in the "Card_Category" column: Blue, Silver, Gold, and Platinum.
  • The majority of customers, 9,436, have a "Blue" card category.
  • The "Silver," "Gold," and "Platinum" categories have significantly fewer occurrences.

Replacing the anomalous values with NaN¶

In [ ]:
df["Income_Category"].replace("abc", np.nan, inplace=True)

Exploratory Data Analysis (EDA)¶

  • EDA is an important part of any project involving data.
  • It is important to investigate and understand the data better before building a model with it.
  • A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
  • A thorough analysis of the data, in addition to the questions mentioned below, should be done.

Questions:

  1. How is the total transaction amount distributed?
  2. What is the distribution of the level of education of customers?
  3. What is the distribution of the level of income of customers?
  4. How does the change in transaction amount between Q4 and Q1 (total_ct_change_Q4_Q1) vary by the customer's account status (Attrition_Flag)?
  5. How does the number of months a customer was inactive in the last 12 months (Months_Inactive_12_mon) vary by the customer's account status (Attrition_Flag)?
  6. What are the attributes that have a strong correlation with each other?

The below functions need to be defined to carry out the Exploratory Data Analysis.¶

In [ ]:
# function to plot a boxplot and a histogram along the same scale.


def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12,7))
    kde: whether to the show density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a triangle will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram
In [ ]:
# function to create labeled barplots


def labeled_barplot(data, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 1, 5))
    else:
        plt.figure(figsize=(n + 1, 5))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data,
        x=feature,
        palette="Paired",
        order=data[feature].value_counts().index[:n].sort_values(),
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

    plt.show()  # show the plot
In [ ]:
# function to plot stacked bar chart

def stacked_barplot(data, predictor, target):
    """
    Print the category counts and plot a stacked bar chart

    data: dataframe
    predictor: independent variable
    target: target variable
    """
    count = data[predictor].nunique()
    sorter = data[target].value_counts().index[-1]
    tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
        by=sorter, ascending=False
    )
    print(tab1)
    print("-" * 120)
    tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
        by=sorter, ascending=False
    )
    tab.plot(kind="bar", stacked=True, figsize=(count + 1, 5))
    plt.legend(
        loc="lower left", frameon=False,
    )
    plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
    plt.show()
In [ ]:
# Function to plot distributions

def distribution_plot_wrt_target(data, predictor, target):

    fig, axs = plt.subplots(2, 2, figsize=(12, 10))

    target_uniq = data[target].unique()

    axs[0, 0].set_title(
        "Distribution of target for target=" + str(target_uniq[0]))
    sns.histplot(
        data=data[data[target] == target_uniq[0]],
        x=predictor,
        kde=True,
        ax=axs[0, 0],
        color="teal",
    )

    axs[0, 1].set_title(
        "Distribution of target for target=" + str(target_uniq[1]))
    sns.histplot(
        data=data[data[target] == target_uniq[1]],
        x=predictor,
        kde=True,
        ax=axs[0, 1],
        color="orange",
    )

    axs[1, 0].set_title("Boxplot w.r.t target")
    sns.boxplot(data=data, x=target, y=predictor,
                ax=axs[1, 0], palette="gist_rainbow")

    axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
    sns.boxplot(
        data=data,
        x=target,
        y=predictor,
        ax=axs[1, 1],
        showfliers=False,
        palette="gist_rainbow",
    )

    plt.tight_layout()
    plt.show()

Univariate analysis¶

In [ ]:
# histogram_box plot for all the numerical values
for column in df.columns:
    if df[column].dtypes != 'category':
        histogram_boxplot(df, column)

Observations¶

Attrition_Flag Customer_Age Dependent_count Months_on_book Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio

In [ ]:
#histogram_box plot for all the categorical values
for column in df.columns:
    if df[column].dtypes == 'category':
        labeled_barplot(df, column)

Bivariate Analysis¶

In [ ]:
sns.pairplot(df, hue="Attrition_Flag")
Out[ ]:
<seaborn.axisgrid.PairGrid at 0x7f9dfbec2310>
In [ ]:
for column in ['Gender', 'Marital_Status', 'Education_Level', 'Income_Category', 'Contacts_Count_12_mon', 'Months_Inactive_12_mon', 'Total_Relationship_Count', 'Dependent_count']:
    print('-'*50)
    print(f'{column} vs Attrition_Flag')
    stacked_barplot(df, column, "Attrition_Flag")
--------------------------------------------------
Gender vs Attrition_Flag
Attrition_Flag     0     1    All
Gender                           
All             8500  1627  10127
F               4428   930   5358
M               4072   697   4769
------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------
Marital_Status vs Attrition_Flag
Attrition_Flag     0     1   All
Marital_Status                  
All             7880  1498  9378
Married         3978   709  4687
Single          3275   668  3943
Divorced         627   121   748
------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------
Education_Level vs Attrition_Flag
Attrition_Flag      0     1   All
Education_Level                  
All              7237  1371  8608
Graduate         2641   487  3128
High School      1707   306  2013
Uneducated       1250   237  1487
College           859   154  1013
Doctorate         356    95   451
Post-Graduate     424    92   516
------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------
Income_Category vs Attrition_Flag
Attrition_Flag      0     1   All
Income_Category                  
All              7575  1440  9015
Less than $40K   2949   612  3561
$40K - $60K      1519   271  1790
$80K - $120K     1293   242  1535
$60K - $80K      1213   189  1402
$120K +           601   126   727
------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------
Contacts_Count_12_mon vs Attrition_Flag
Attrition_Flag            0     1    All
Contacts_Count_12_mon                   
All                    8500  1627  10127
3                      2699   681   3380
2                      2824   403   3227
4                      1077   315   1392
1                      1391   108   1499
5                       117    59    176
6                         0    54     54
0                       392     7    399
------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------
Months_Inactive_12_mon vs Attrition_Flag
Attrition_Flag             0     1    All
Months_Inactive_12_mon                   
All                     8500  1627  10127
3                       3020   826   3846
2                       2777   505   3282
4                        305   130    435
1                       2133   100   2233
5                        146    32    178
6                        105    19    124
0                         14    15     29
------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------
Total_Relationship_Count vs Attrition_Flag
Attrition_Flag               0     1    All
Total_Relationship_Count                   
All                       8500  1627  10127
3                         1905   400   2305
2                          897   346   1243
1                          677   233    910
5                         1664   227   1891
4                         1687   225   1912
6                         1670   196   1866
------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------
Dependent_count vs Attrition_Flag
Attrition_Flag      0     1    All
Dependent_count                   
All              8500  1627  10127
3                2250   482   2732
2                2238   417   2655
1                1569   269   1838
4                1314   260   1574
0                 769   135    904
5                 360    64    424
------------------------------------------------------------------------------------------------------------------------
In [ ]:
for column in ['Total_Revolving_Bal', 'Credit_Limit', 'Customer_Age', 'Total_Trans_Ct', 'Total_Trans_Amt', 'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio', 'Months_on_book']:
    print('-'*50)
    print(f'{column} vs Attrition_Flag')
    distribution_plot_wrt_target(df, column, "Attrition_Flag")
--------------------------------------------------
Total_Revolving_Bal vs Attrition_Flag
--------------------------------------------------
Credit_Limit vs Attrition_Flag
--------------------------------------------------
Customer_Age vs Attrition_Flag
--------------------------------------------------
Total_Trans_Ct vs Attrition_Flag
--------------------------------------------------
Total_Trans_Amt vs Attrition_Flag
--------------------------------------------------
Total_Ct_Chng_Q4_Q1 vs Attrition_Flag
--------------------------------------------------
Avg_Utilization_Ratio vs Attrition_Flag
--------------------------------------------------
Months_on_book vs Attrition_Flag
  1. How is the total transaction amount distributed?
In [ ]:
histogram_boxplot(df, 'Total_Trans_Amt')

Observations¶

  • The distribution of Total_Trans_Amt is right-skewed
  • The boxplot shows that there are outliers at the right end
  1. What is the distribution of the level of education of customers?
In [ ]:
labeled_barplot(df, 'Education_Level')

Observations¶

Most of the customers has a Graduate and High school education level

  1. What is the distribution of the level of income of customers?
In [ ]:
labeled_barplot(df, 'Income_Category')

Observations¶

Most of the customer has an income less than $40k

  1. How does the change in transaction amount between Q4 and Q1 (total_ct_change_Q4_Q1) vary by the customer's account status (Attrition_Flag)?
In [ ]:
distribution_plot_wrt_target(df, "Total_Ct_Chng_Q4_Q1", "Attrition_Flag")
  1. How does the number of months a customer was inactive in the last 12 months (Months_Inactive_12_mon) vary by the customer's account status (Attrition_Flag)?
In [ ]:
distribution_plot_wrt_target(df, 'Months_Inactive_12_mon', "Attrition_Flag")
  1. What are the attributes that have a strong correlation with each other?
In [ ]:
columns = []
for column in df.columns:
    if df[column].dtypes != 'category':
        columns.append(column)

plt.figure(figsize=(15, 7))
sns.heatmap(df[columns].corr(), annot=True, vmin=-
            1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()

Observations¶

  • Avg_Open_To_Buy and Credit_Limit: These variables have a perfect positive correlation of 1. It indicates that as the Avg_Open_To_Buy increases, the Credit_Limit also increase.

  • Total_Trans_Amt and Total_Trans_Ct: These variables have a strong positive correlation of 0.807192. It indicates that as the total transaction amount increases, the total transaction count also tends to increase.

  • Months_on_book and Customer_Age: These variables have a strong positive of 0.788912. It indicates that customers with a higher Months_on_book tend to have a higher Customer_Age.

On the other hand, the following high negative correlations can be observed:

  • Total_Ct_Chng_Q4_Q1 and Attrition_Flag: These variables have a moderate negative correlation of -0.290054. It suggests that customers who have a higher rate of change in the number of transactions between Q4 and Q1 are less likely to churn (attrit).

  • Credit_Limit and Avg_Utilization_Ratio: These variables have a moderate negative correlation of -0.482965. It indicates that as the credit limit increases, the average utilization ratio tends to decrease.

Data Pre-processing¶

In [ ]:
data = df.copy()
data.isna().sum()
Out[ ]:
Attrition_Flag                 0
Customer_Age                   0
Gender                         0
Dependent_count                0
Education_Level             1519
Marital_Status               749
Income_Category             1112
Card_Category                  0
Months_on_book                 0
Total_Relationship_Count       0
Months_Inactive_12_mon         0
Contacts_Count_12_mon          0
Credit_Limit                   0
Total_Revolving_Bal            0
Avg_Open_To_Buy                0
Total_Amt_Chng_Q4_Q1           0
Total_Trans_Amt                0
Total_Trans_Ct                 0
Total_Ct_Chng_Q4_Q1            0
Avg_Utilization_Ratio          0
dtype: int64
In [ ]:
x = df.drop(["Attrition_Flag"], axis=1)
y = df["Attrition_Flag"]

# Splitting data into training, validation and test sets:
# first we split data into 2 parts, say temporary and test

x_temp, x_test, y_temp, y_test = train_test_split(
    x, y, test_size=0.2, random_state=1, stratify=y
)

# then we split the temporary set into train and validation

x_train, x_val, y_train, y_val = train_test_split(
    x_temp, y_temp, test_size=0.25, random_state=1, stratify=y_temp
)
print(x_train.shape, x_val.shape, x_test.shape)
(6075, 19) (2026, 19) (2026, 19)

Missing value imputation¶

Dividing train data into x and y

In [ ]:
# Let's impute the missing values
imp_mode = SimpleImputer(missing_values=np.nan, strategy="most_frequent")
cols_to_impute = ["Education_Level", "Marital_Status", "Income_Category"]

# fit and transform the imputer on train data
x_train[cols_to_impute] = imp_mode.fit_transform(x_train[cols_to_impute])

# Transform on validation and test data
x_val[cols_to_impute] = imp_mode.transform(x_val[cols_to_impute])

# fit and transform the imputer on test data
x_test[cols_to_impute] = imp_mode.transform(x_test[cols_to_impute])

# Creating dummy variables for categorical variables
x_train = pd.get_dummies(data=x_train, drop_first=True)
x_val = pd.get_dummies(data=x_val, drop_first=True)
x_test = pd.get_dummies(data=x_test, drop_first=True)
In [ ]:
for x in [x_train, x_val, x_test]:
    print('-'*50)
    print(x.isnull().sum())
--------------------------------------------------
Customer_Age                      0
Dependent_count                   0
Months_on_book                    0
Total_Relationship_Count          0
Months_Inactive_12_mon            0
Contacts_Count_12_mon             0
Credit_Limit                      0
Total_Revolving_Bal               0
Avg_Open_To_Buy                   0
Total_Amt_Chng_Q4_Q1              0
Total_Trans_Amt                   0
Total_Trans_Ct                    0
Total_Ct_Chng_Q4_Q1               0
Avg_Utilization_Ratio             0
Gender_M                          0
Education_Level_Doctorate         0
Education_Level_Graduate          0
Education_Level_High School       0
Education_Level_Post-Graduate     0
Education_Level_Uneducated        0
Marital_Status_Married            0
Marital_Status_Single             0
Income_Category_$40K - $60K       0
Income_Category_$60K - $80K       0
Income_Category_$80K - $120K      0
Income_Category_Less than $40K    0
Card_Category_Gold                0
Card_Category_Platinum            0
Card_Category_Silver              0
dtype: int64
--------------------------------------------------
Customer_Age                      0
Dependent_count                   0
Months_on_book                    0
Total_Relationship_Count          0
Months_Inactive_12_mon            0
Contacts_Count_12_mon             0
Credit_Limit                      0
Total_Revolving_Bal               0
Avg_Open_To_Buy                   0
Total_Amt_Chng_Q4_Q1              0
Total_Trans_Amt                   0
Total_Trans_Ct                    0
Total_Ct_Chng_Q4_Q1               0
Avg_Utilization_Ratio             0
Gender_M                          0
Education_Level_Doctorate         0
Education_Level_Graduate          0
Education_Level_High School       0
Education_Level_Post-Graduate     0
Education_Level_Uneducated        0
Marital_Status_Married            0
Marital_Status_Single             0
Income_Category_$40K - $60K       0
Income_Category_$60K - $80K       0
Income_Category_$80K - $120K      0
Income_Category_Less than $40K    0
Card_Category_Gold                0
Card_Category_Platinum            0
Card_Category_Silver              0
dtype: int64
--------------------------------------------------
Customer_Age                      0
Dependent_count                   0
Months_on_book                    0
Total_Relationship_Count          0
Months_Inactive_12_mon            0
Contacts_Count_12_mon             0
Credit_Limit                      0
Total_Revolving_Bal               0
Avg_Open_To_Buy                   0
Total_Amt_Chng_Q4_Q1              0
Total_Trans_Amt                   0
Total_Trans_Ct                    0
Total_Ct_Chng_Q4_Q1               0
Avg_Utilization_Ratio             0
Gender_M                          0
Education_Level_Doctorate         0
Education_Level_Graduate          0
Education_Level_High School       0
Education_Level_Post-Graduate     0
Education_Level_Uneducated        0
Marital_Status_Married            0
Marital_Status_Single             0
Income_Category_$40K - $60K       0
Income_Category_$60K - $80K       0
Income_Category_$80K - $120K      0
Income_Category_Less than $40K    0
Card_Category_Gold                0
Card_Category_Platinum            0
Card_Category_Silver              0
dtype: int64

Model Building¶

Model evaluation criterion¶

The nature of predictions made by the classification model will translate as follows:

  • True positives (TP) are failures correctly predicted by the model.
  • False negatives (FN) are real failures in a generator where there is no detection by model.
  • False positives (FP) are failure detections in a generator where there is no failure.

Which metric to optimize?

  • We need to choose the metric which will ensure that the maximum number of generator failures are predicted correctly by the model.
  • We would want Recall to be maximized as greater the Recall, the higher the chances of minimizing false negatives.
  • We want to minimize false negatives because if a model predicts that a machine will have no failure when there will be a failure, it will increase the maintenance cost.

Let's define a function to output different metrics (including recall) on the train and test set and a function to show confusion matrix so that we do not have to use the same code repetitively while evaluating models.

In [ ]:
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {
            "Accuracy": acc,
            "Recall": recall,
            "Precision": precision,
            "F1": f1

        },
        index=[0],
    )

    return df_perf

Model Building with original data¶

In [ ]:
Model_List = []
Model_List.append(("Bagging", BaggingClassifier, {'random_state': 1}))
Model_List.append(
    ("Random forest", RandomForestClassifier, {'random_state': 1}))
Model_List.append(("GBM", GradientBoostingClassifier, {'random_state': 1}))
Model_List.append(("Adaboost", AdaBoostClassifier, {'random_state': 1}))
Model_List.append(("Xgboost", XGBClassifier, {
                  'random_state': 1, 'eval_metric': "logloss"}))
Model_List.append(("dtree", DecisionTreeClassifier, {'random_state': 1}))


Models = []  # Empty list to store all the models

# Appending models into the list
for name, model, params in Model_List:
    Models.append((name, model(**params)))


for name, model in Models:
    print('-'*50)
    print(name)
    model.fit(x_train, y_train)
    print("\nTraining performance:")
    train = model_performance_classification_sklearn(
        model, x_train, y_train
    )
    print(train)
    print("Validation performance:")
    validation = model_performance_classification_sklearn(
        model, x_val, y_val
    )
    print(validation)
--------------------------------------------------
Bagging

Training performance:
   Accuracy    Recall  Precision        F1
0  0.997202  0.985656   0.996891  0.991242
Validation performance:
   Accuracy    Recall  Precision       F1
0  0.956071  0.812883   0.904437  0.85622
--------------------------------------------------
Random forest

Training performance:
   Accuracy  Recall  Precision   F1
0       1.0     1.0        1.0  1.0
Validation performance:
   Accuracy    Recall  Precision        F1
0  0.956565  0.797546   0.921986  0.855263
--------------------------------------------------
GBM

Training performance:
   Accuracy  Recall  Precision        F1
0   0.97284   0.875   0.952062  0.911906
Validation performance:
   Accuracy    Recall  Precision        F1
0  0.967423  0.855828   0.936242  0.894231
--------------------------------------------------
Adaboost

Training performance:
   Accuracy    Recall  Precision        F1
0  0.957366  0.826844   0.899666  0.861719
Validation performance:
   Accuracy    Recall  Precision        F1
0  0.961994  0.852761   0.905537  0.878357
--------------------------------------------------
Xgboost

Training performance:
   Accuracy  Recall  Precision   F1
0       1.0     1.0        1.0  1.0
Validation performance:
   Accuracy    Recall  Precision        F1
0  0.969398  0.883436   0.923077  0.902821
--------------------------------------------------
dtree

Training performance:
   Accuracy  Recall  Precision   F1
0       1.0     1.0        1.0  1.0
Validation performance:
   Accuracy    Recall  Precision        F1
0  0.938796  0.815951   0.806061  0.810976

Observations¶

Bagging:

  • The training performance of the Bagging model shows high accuracy, recall, precision, and F1 score, indicating that the model performs well on the training data.
  • The validation performance of the Bagging model also demonstrates good metrics but slightly lower than the training performance. It still achieves high accuracy, recall, precision, and F1 score, suggesting generalization to unseen data.

Random Forest:

  • The training performance of the Random Forest model shows perfect accuracy, recall, precision, and F1 score, indicating that the model perfectly fits the training data. This might suggest potential overfitting.
  • The validation performance of the Random Forest model demonstrates high accuracy, recall, precision, and F1 score, although slightly lower than the perfect training performance. This suggests good generalization capability of the model.

GBM (Gradient Boosting Machine):

  • The training performance of the GBM model shows high accuracy, recall, precision, and F1 score, indicating good performance on the training data.
  • The validation performance of the GBM model also demonstrates high accuracy, recall, precision, and F1 score, suggesting good generalization ability.

Adaboost:

  • The training performance of the Adaboost model shows high accuracy, recall, precision, and F1 score, indicating good performance on the training data.
  • The validation performance of the Adaboost model also demonstrates high accuracy, recall, precision, and F1 score, suggesting good generalization capability.

Xgboost:

  • The training performance of the Xgboost model shows perfect accuracy, recall, precision, and F1 score, indicating that the model perfectly fits the training data. This might suggest potential overfitting.
  • The validation performance of the Xgboost model demonstrates high accuracy, recall, precision, and F1 score, indicating good generalization ability.

Decision Tree (dtree):

  • The training performance of the Decision Tree model shows perfect accuracy, recall, precision, and F1 score, indicating that the model perfectly fits the training data. This might suggest potential overfitting.
  • The validation performance of the Decision Tree model demonstrates decent accuracy, recall, precision, and F1 score, although slightly lower than the perfect training performance. This suggests good generalization capability of the model.

Model Building with Oversampled data¶

In [ ]:
# Synthetic Minority Over Sampling Technique
sm = SMOTE(sampling_strategy=1, k_neighbors=5, random_state=1)
x_train_over, y_train_over = sm.fit_resample(x_train, y_train)

for name, model in Models:
    print('-'*50)
    print(name)
    model.fit(x_train_over, y_train_over)
    print("\nTraining performance:")
    train = model_performance_classification_sklearn(
        model, x_train_over, y_train_over
    )
    print(train)
    print("Validation performance:")
    validation = model_performance_classification_sklearn(
        model, x_val, y_val
    )
    print(validation)
--------------------------------------------------
Bagging

Training performance:
   Accuracy    Recall  Precision        F1
0  0.998333  0.997647   0.999018  0.998332
Validation performance:
   Accuracy    Recall  Precision        F1
0  0.946693  0.861963    0.81686  0.838806
--------------------------------------------------
Random forest

Training performance:
   Accuracy  Recall  Precision   F1
0       1.0     1.0        1.0  1.0
Validation performance:
   Accuracy    Recall  Precision        F1
0  0.953603  0.861963   0.851515  0.856707
--------------------------------------------------
GBM

Training performance:
   Accuracy    Recall  Precision        F1
0  0.975583  0.979212   0.972157  0.975672
Validation performance:
   Accuracy    Recall  Precision        F1
0  0.957058  0.904908   0.840456  0.871492
--------------------------------------------------
Adaboost

Training performance:
   Accuracy    Recall  Precision        F1
0   0.96009  0.964699   0.955888  0.960273
Validation performance:
   Accuracy   Recall  Precision        F1
0  0.945706  0.90184   0.790323  0.842407
--------------------------------------------------
Xgboost

Training performance:
   Accuracy  Recall  Precision   F1
0       1.0     1.0        1.0  1.0
Validation performance:
   Accuracy   Recall  Precision        F1
0  0.971866  0.92638   0.901493  0.913767
--------------------------------------------------
dtree

Training performance:
   Accuracy  Recall  Precision   F1
0       1.0     1.0        1.0  1.0
Validation performance:
   Accuracy    Recall  Precision        F1
0  0.934353  0.865031   0.760108  0.809182

Observations¶

Bagging:

  • The training performance of the Bagging model shows high accuracy, recall, precision, and F1 score, indicating that the model performs well on the training data.
  • The validation performance of the Bagging model demonstrates good accuracy, recall, precision, and F1 score, although slightly lower than the training performance. This suggests reasonable generalization to unseen data.

Random Forest:

  • The training performance of the Random Forest model shows perfect accuracy, recall, precision, and F1 score, indicating that the model perfectly fits the training data. This might suggest potential overfitting.
  • The validation performance of the Random Forest model demonstrates high accuracy, recall, precision, and F1 score, suggesting good generalization capability of the model.

GBM (Gradient Boosting Machine):

  • The training performance of the GBM model shows high accuracy, recall, precision, and F1 score, indicating good performance on the training data.
  • The validation performance of the GBM model also demonstrates high accuracy, recall, and F1 score. However, the precision is slightly lower, suggesting some difficulty in correctly identifying positive cases.

Adaboost:

  • The training performance of the Adaboost model shows high accuracy, recall, precision, and F1 score, indicating good performance on the training data.
  • The validation performance of the Adaboost model demonstrates decent accuracy, recall, and F1 score. However, the precision is lower, indicating some challenges in accurately predicting positive cases.

Xgboost:

  • The training performance of the Xgboost model shows perfect accuracy, recall, precision, and F1 score, indicating that the model perfectly fits the training data. This might suggest potential overfitting.
  • The validation performance of the Xgboost model demonstrates high accuracy, recall, precision, and F1 score, indicating good generalization ability.

Decision Tree (dtree):

  • The training performance of the Decision Tree model shows perfect accuracy, recall, precision, and F1 score, indicating that the model perfectly fits the training data. This might suggest potential overfitting.
  • The validation performance of the Decision Tree model demonstrates decent accuracy, recall, and F1 score. However, the precision is lower, suggesting some difficulty in correctly predicting positive cases.

Model Building with Undersampled data¶

In [ ]:
# Random undersampler for under sampling the data
rus = RandomUnderSampler(random_state=1, sampling_strategy=1)
x_train_un, y_train_un = rus.fit_resample(x_train, y_train)

for name, model in Models:
    print('-'*50)
    print(name)
    model.fit(x_train_un, y_train_un)
    print("\nTraining performance:")
    train = model_performance_classification_sklearn(
        model, x_train_un, y_train_un
    )
    print(train)
    print("Validation performance:")
    validation = model_performance_classification_sklearn(
        model, x_val, y_val
    )
    print(validation)
--------------------------------------------------
Bagging

Training performance:
   Accuracy    Recall  Precision        F1
0  0.995389  0.990779        1.0  0.995368
Validation performance:
   Accuracy    Recall  Precision        F1
0  0.924975  0.929448   0.701389  0.799472
--------------------------------------------------
Random forest

Training performance:
   Accuracy  Recall  Precision   F1
0       1.0     1.0        1.0  1.0
Validation performance:
   Accuracy   Recall  Precision        F1
0   0.93386  0.93865   0.728571  0.820375
--------------------------------------------------
GBM

Training performance:
   Accuracy    Recall  Precision        F1
0  0.974385  0.980533   0.968623  0.974542
Validation performance:
   Accuracy    Recall  Precision        F1
0  0.934847  0.957055   0.725581  0.825397
--------------------------------------------------
Adaboost

Training performance:
   Accuracy    Recall  Precision        F1
0  0.949795  0.952869   0.947047  0.949949
Validation performance:
   Accuracy    Recall  Precision        F1
0  0.928924  0.960123   0.704955  0.812987
--------------------------------------------------
Xgboost

Training performance:
   Accuracy  Recall  Precision   F1
0       1.0     1.0        1.0  1.0
Validation performance:
   Accuracy    Recall  Precision        F1
0  0.938796  0.957055   0.739336  0.834225
--------------------------------------------------
dtree

Training performance:
   Accuracy  Recall  Precision   F1
0       1.0     1.0        1.0  1.0
Validation performance:
   Accuracy    Recall  Precision        F1
0  0.894867  0.920245   0.616016  0.738007

Observartions¶

Bagging:

  • The training performance of the Bagging model shows high accuracy, recall, precision, and F1 score, indicating that the model performs well on the training data.
  • The validation performance of the Bagging model demonstrates decent accuracy and recall. However, the precision and F1 score are lower, indicating challenges in correctly predicting positive cases.

Random Forest:

  • The training performance of the Random Forest model shows perfect accuracy, recall, precision, and F1 score, indicating that the model perfectly fits the training data. This might suggest potential overfitting.
  • The validation performance of the Random Forest model demonstrates decent accuracy, recall, and F1 score. However, the precision is lower, indicating challenges in correctly predicting positive cases.

GBM (Gradient Boosting Machine):

  • The training performance of the GBM model shows high accuracy, recall, precision, and F1 score, indicating good performance on the training data.
  • The validation performance of the GBM model demonstrates decent accuracy, recall, precision, and F1 score. However, the precision is lower, indicating challenges in correctly predicting positive cases.

Adaboost:

  • The training performance of the Adaboost model shows high accuracy, recall, precision, and F1 score, indicating good performance on the training data.
  • The validation performance of the Adaboost model demonstrates decent accuracy, recall, and F1 score. However, the precision is lower, indicating challenges in correctly predicting positive cases.

Xgboost:

  • The training performance of the Xgboost model shows perfect accuracy, recall, precision, and F1 score, indicating that the model perfectly fits the training data. This might suggest potential overfitting.
  • The validation performance of the Xgboost model demonstrates decent accuracy, recall, and F1 score. However, the precision is lower, indicating challenges in correctly predicting positive cases.

Decision Tree (dtree):

  • The training performance of the Decision Tree model shows perfect accuracy, recall, precision, and F1 score, indicating that the model perfectly fits the training data. This might suggest potential overfitting.
  • The validation performance of the Decision Tree model demonstrates lower accuracy, recall, precision, and F1 score compared to other models, indicating challenges in generalization to unseen data.

Hyperparameter Tuning¶

Sample Parameter Grids¶

Hyperparameter tuning can take a long time to run, so to avoid that time complexity - you can use the following grids, wherever required.

  • For Gradient Boosting:
param_grid = {
    "init": [AdaBoostClassifier(random_state=1),DecisionTreeClassifier(random_state=1)],
    "n_estimators": np.arange(75,150,25),
    "learning_rate": [0.1, 0.01, 0.2, 0.05, 1],
    "subsample":[0.5,0.7,1],
    "max_features":[0.5,0.7,1],
}
  • For Adaboost:
param_grid = {
     "n_estimators": np.arange(10, 110, 10),
     "learning_rate": [0.1, 0.01, 0.2, 0.05, 1],
     "base_estimator": [
         DecisionTreeClassifier(max_depth=1, random_state=1),
         DecisionTreeClassifier(max_depth=2, random_state=1),
         DecisionTreeClassifier(max_depth=3, random_state=1),
    ]
}
  • For Bagging Classifier:
param_grid = {
    'max_samples': [0.8,0.9,1],
    'max_features': [0.7,0.8,0.9],
    'n_estimators' : [30,50,70],
}
  • For Random Forest:
param_grid = {
    "n_estimators": [200,250,300],
    "min_samples_leaf": np.arange(1, 4),
    "max_features": [np.arange(0.3, 0.6, 0.1),'sqrt'],
    "max_samples": np.arange(0.4, 0.7, 0.1)
}
  • For Decision Trees:
param_grid = {
    'max_depth': np.arange(2,6),
    'min_samples_leaf': [1, 4, 7],
    'max_leaf_nodes' : [10, 15],
    'min_impurity_decrease': [0.0001,0.001]
}
  • For XGBoost:
param_grid={
   'n_estimators':np.arange(50,300,50),
   'scale_pos_weight':[0,1,2,5,10],
   'learning_rate':[0.01,0.1,0.2,0.05],
   'gamma':[0,1,3,5],
   'subsample':[0.7,0.8,0.9,1]
}

Tuning all models using original data¶

In [ ]:
Params = []  # Empty list to store all the params

# Appending params into the list
Params.append(
    {
        "name": "Bagging",
        "params":
            {
                'max_samples': [0.8, 0.9, 1],
                'max_features': [0.7, 0.8, 0.9],
                'n_estimators': [30, 50, 70],
            }
    }
)
Params.append(
    {
        "name": "Random forest",
        "params":
            {
                "n_estimators": [200, 250, 300],
                "min_samples_leaf": np.arange(1, 4),
                "max_features": [np.arange(0.3, 0.6, 0.1), 'sqrt'],
                "max_samples": np.arange(0.4, 0.7, 0.1)
            }
    }
)
Params.append(
    {
        "name": "GBM",
        "params":
            {
                "init": [AdaBoostClassifier(random_state=1), DecisionTreeClassifier(random_state=1)],
                "n_estimators": np.arange(75, 150, 25),
                "learning_rate": [0.1, 0.01, 0.2, 0.05, 1],
                "subsample": [0.5, 0.7, 1],
                "max_features": [0.5, 0.7, 1],
            }
    }
)
Params.append(
    {
        "name": "Adaboost",
        "params":
            {
                "n_estimators": np.arange(10, 110, 10),
                "learning_rate": [0.1, 0.01, 0.2, 0.05, 1],
                "base_estimator": [
                    DecisionTreeClassifier(max_depth=1, random_state=1),
                    DecisionTreeClassifier(max_depth=2, random_state=1),
                    DecisionTreeClassifier(max_depth=3, random_state=1),
                ]
            }
    }
)
Params.append(
    {
        "name": "Xgboost",
        "params":
            {
                'n_estimators': np.arange(50, 300, 50),
                'scale_pos_weight': [0, 1, 2, 5, 10],
                'learning_rate': [0.01, 0.1, 0.2, 0.05],
                'gamma': [0, 1, 3, 5],
                'subsample': [0.7, 0.8, 0.9, 1]
            }
    }
)
Params.append(
    {
        "name": "dtree",
        "params":
            {
                'max_depth': np.arange(2, 6),
                'min_samples_leaf': [1, 4, 7],
                'max_leaf_nodes': [10, 15],
                'min_impurity_decrease': [0.0001, 0.001]
            }
    }
)

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

Best_Params = []

# Defining a tuning function 
def tuning(x, y, description):
    for name, model in Models:
        print('-'*50)
        print(name, '-', description)
        # Selecting the right params
        New_Params = [param for param in Params if param['name']
                      == name][0]['params']
        # Calling RandomizedSearchCV
        randomized_cv = RandomizedSearchCV(
            estimator=model,
            param_distributions=New_Params,
            n_iter=10,
            n_jobs=-1,
            scoring=scorer,
            cv=5,
            random_state=1
        )
        # Fitting parameters in RandomizedSearchCV
        randomized_cv.fit(x, y)
        # SAving parameters
        Best_Params.append(
            {
                'name': name,
                'description': description,
                'params': randomized_cv.best_params_,
                'score': randomized_cv.best_score_
            }
        )
In [ ]:
# Hyperparameter Tuning for every model
tuning(x_train, y_train, "Original data")
tuning(x_train_over, y_train_over, "Oversampled data")
tuning(x_train_un, y_train_un, "Undersampled data")
In [ ]:
# Printing best Hyperparameters 
for param in Best_Params:
    print('-'*50)
    print(param['name'], '-', param['description'])
    print("Best parameters are {} with CV score={}:" .format(
        param['params'], param['score']))
--------------------------------------------------
Bagging - Original data
Best parameters are {'n_estimators': 70, 'max_samples': 0.8, 'max_features': 0.9} with CV score=0.8268079539508111:
--------------------------------------------------
Random forest - Original data
Best parameters are {'n_estimators': 250, 'min_samples_leaf': 1, 'max_samples': 0.6, 'max_features': 'sqrt'} with CV score=0.7438147566718996:
--------------------------------------------------
GBM - Original data
Best parameters are {'subsample': 0.7, 'n_estimators': 125, 'max_features': 0.7, 'learning_rate': 0.2, 'init': AdaBoostClassifier(random_state=1)} with CV score=0.849361590790162:
--------------------------------------------------
Adaboost - Original data
Best parameters are {'n_estimators': 90, 'learning_rate': 1, 'base_estimator': DecisionTreeClassifier(max_depth=2, random_state=1)} with CV score=0.8637205651491365:
--------------------------------------------------
Xgboost - Original data
Best parameters are {'subsample': 0.7, 'scale_pos_weight': 10, 'n_estimators': 250, 'learning_rate': 0.2, 'gamma': 3} with CV score=0.9108267922553637:
--------------------------------------------------
dtree - Original data
Best parameters are {'min_samples_leaf': 7, 'min_impurity_decrease': 0.0001, 'max_leaf_nodes': 15, 'max_depth': 5} with CV score=0.751941391941392:
--------------------------------------------------
Bagging - Oversampled data
Best parameters are {'n_estimators': 50, 'max_samples': 0.9, 'max_features': 0.7} with CV score=0.9805870807596836:
--------------------------------------------------
Random forest - Oversampled data
Best parameters are {'n_estimators': 300, 'min_samples_leaf': 1, 'max_samples': 0.6, 'max_features': 'sqrt'} with CV score=0.9750974619484692:
--------------------------------------------------
GBM - Oversampled data
Best parameters are {'subsample': 0.5, 'n_estimators': 100, 'max_features': 0.7, 'learning_rate': 0.1, 'init': AdaBoostClassifier(random_state=1)} with CV score=0.957253362581539:
--------------------------------------------------
Adaboost - Oversampled data
Best parameters are {'n_estimators': 90, 'learning_rate': 1, 'base_estimator': DecisionTreeClassifier(max_depth=2, random_state=1)} with CV score=0.944115722834767:
--------------------------------------------------
Xgboost - Oversampled data
Best parameters are {'subsample': 1, 'scale_pos_weight': 5, 'n_estimators': 50, 'learning_rate': 0.05, 'gamma': 1} with CV score=0.9868623602532279:
--------------------------------------------------
dtree - Oversampled data
Best parameters are {'min_samples_leaf': 1, 'min_impurity_decrease': 0.001, 'max_leaf_nodes': 15, 'max_depth': 4} with CV score=0.9048877215262945:
--------------------------------------------------
Bagging - Undersampled data
Best parameters are {'n_estimators': 70, 'max_samples': 1, 'max_features': 0.8} with CV score=1.0:
--------------------------------------------------
Random forest - Undersampled data
Best parameters are {'n_estimators': 300, 'min_samples_leaf': 1, 'max_samples': 0.6, 'max_features': 'sqrt'} with CV score=0.930319204604919:
--------------------------------------------------
GBM - Undersampled data
Best parameters are {'subsample': 0.7, 'n_estimators': 125, 'max_features': 0.7, 'learning_rate': 0.2, 'init': AdaBoostClassifier(random_state=1)} with CV score=0.9559340659340659:
--------------------------------------------------
Adaboost - Undersampled data
Best parameters are {'n_estimators': 70, 'learning_rate': 0.1, 'base_estimator': DecisionTreeClassifier(max_depth=2, random_state=1)} with CV score=0.9436630036630037:
--------------------------------------------------
Xgboost - Undersampled data
Best parameters are {'subsample': 0.7, 'scale_pos_weight': 10, 'n_estimators': 250, 'learning_rate': 0.2, 'gamma': 3} with CV score=0.9764364207221352:
--------------------------------------------------
dtree - Undersampled data
Best parameters are {'min_samples_leaf': 7, 'min_impurity_decrease': 0.0001, 'max_leaf_nodes': 15, 'max_depth': 5} with CV score=0.8934432234432235:
In [ ]:
Results = []


def Tuned_Models(x, y, description):
    print(description, "\n")
    for name, model, params in Model_List:
        # Selecting Hyperparameter params for every model 
        New_Params = [param for param in Best_Params if param['name']
                      == name and param['description'] == description][0]['params']
        print('-'*50)
        print(name)
        print('Parameters:', New_Params)
        # Model
        Model_Tuned = model(**New_Params)
        Model_Tuned.fit(x, y)
        print("\nTraining performance:")
        # Checking model's performance on training set
        train = model_performance_classification_sklearn(
            Model_Tuned, x_train, y_train
        )
        print(train)
        print("Validation performance:")
        # Checking model's performance on validation set
        validation = model_performance_classification_sklearn(
            Model_Tuned, x_val, y_val
        )
        print(validation)
        # Saving results
        Results.append(
            {
                "name": name,
                "description": description,
                "training": train,
                "validation": validation,
                "model": Model_Tuned,
                "params": New_Params
            }
        )

Tuned models with original data¶

In [ ]:
Tuned_Models(x_train, y_train, "Original data")
Original data 

--------------------------------------------------
Bagging
Parameters: {'n_estimators': 70, 'max_samples': 0.8, 'max_features': 0.9}

Training performance:
   Accuracy    Recall  Precision        F1
0  0.999012  0.995902   0.997947  0.996923
Validation performance:
   Accuracy    Recall  Precision        F1
0  0.963475  0.855828   0.911765  0.882911
--------------------------------------------------
Random forest
Parameters: {'n_estimators': 250, 'min_samples_leaf': 1, 'max_samples': 0.6, 'max_features': 'sqrt'}

Training performance:
   Accuracy    Recall  Precision        F1
0  0.997695  0.985656        1.0  0.992776
Validation performance:
   Accuracy    Recall  Precision    F1
0  0.955577  0.782209   0.930657  0.85
--------------------------------------------------
GBM
Parameters: {'subsample': 0.7, 'n_estimators': 125, 'max_features': 0.7, 'learning_rate': 0.2, 'init': AdaBoostClassifier(random_state=1)}

Training performance:
   Accuracy    Recall  Precision        F1
0  0.989794  0.956967   0.979036  0.967876
Validation performance:
   Accuracy    Recall  Precision        F1
0  0.974334  0.895706   0.941935  0.918239
--------------------------------------------------
Adaboost
Parameters: {'n_estimators': 90, 'learning_rate': 1, 'base_estimator': DecisionTreeClassifier(max_depth=2, random_state=1)}

Training performance:
   Accuracy    Recall  Precision        F1
0  0.995062  0.979508   0.989648  0.984552
Validation performance:
   Accuracy    Recall  Precision        F1
0   0.96693  0.868098   0.921824  0.894155
--------------------------------------------------
Xgboost
Parameters: {'subsample': 0.7, 'scale_pos_weight': 10, 'n_estimators': 250, 'learning_rate': 0.2, 'gamma': 3}

Training performance:
   Accuracy  Recall  Precision        F1
0  0.996379     1.0   0.977956  0.988855
Validation performance:
   Accuracy    Recall  Precision        F1
0  0.971372  0.957055   0.876404  0.914956
--------------------------------------------------
dtree
Parameters: {'min_samples_leaf': 7, 'min_impurity_decrease': 0.0001, 'max_leaf_nodes': 15, 'max_depth': 5}

Training performance:
   Accuracy    Recall  Precision        F1
0  0.938765  0.805328   0.811983  0.808642
Validation performance:
   Accuracy    Recall  Precision       F1
0  0.930405  0.782209   0.784615  0.78341

Observations¶

Bagging:

  • The training performance of the Bagging model shows high accuracy, recall, precision, and F1 score, indicating that the model performs well on the training data.
  • The validation performance of the Bagging model demonstrates good accuracy, recall, precision, and F1 score, although slightly lower than the training performance. This suggests reasonable generalization to unseen data.
  • The Bagging model is trained with parameters such as 70 estimators, a maximum of 80% of samples, and a maximum of 90% of features.

Random Forest:

  • The training performance of the Random Forest model shows high accuracy, recall, precision, and F1 score, indicating that the model performs well on the training data.
  • The validation performance of the Random Forest model demonstrates decent accuracy, recall, precision, and F1 score. However, the recall is slightly lower, suggesting challenges in correctly identifying positive cases.
  • The Random Forest model is trained with parameters such as 250 estimators, a minimum of 1 sample per leaf, a maximum of 60% of samples, and square root of the number of features.

GBM (Gradient Boosting Machine):

  • The training performance of the GBM model shows high accuracy, recall, precision, and F1 score, indicating good performance on the training data.
  • The validation performance of the GBM model also demonstrates high accuracy, recall, precision, and F1 score, suggesting good generalization ability.
  • The GBM model is trained with parameters such as a subsample of 70%, 125 estimators, a maximum of 70% of features, a learning rate of 0.2, and an AdaBoostClassifier as the initialization.

Adaboost:

  • The training performance of the Adaboost model shows high accuracy, recall, precision, and F1 score, indicating good performance on the training data.
  • The validation performance of the Adaboost model demonstrates decent accuracy, recall, precision, and F1 score. However, the recall is slightly lower, suggesting challenges in correctly identifying positive cases.
  • The Adaboost model is trained with parameters such as 90 estimators, a learning rate of 1, and a DecisionTreeClassifier with a maximum depth of 2 as the base estimator.

Xgboost:

  • The training performance of the Xgboost model shows high accuracy, recall, precision, and F1 score, indicating good performance on the training data.
  • The validation performance of the Xgboost model demonstrates high accuracy, recall, precision, and F1 score, suggesting good generalization ability.
  • The Xgboost model is trained with parameters such as a subsample of 70%, a scale_pos_weight of 10, 250 estimators, a learning rate of 0.2, and a gamma value of 3.

Decision Tree (dtree):

  • The training performance of the Decision Tree model shows decent accuracy, recall, precision, and F1 score, indicating reasonable performance on the training data.
  • The validation performance of the Decision Tree model also demonstrates decent accuracy, recall, precision, and F1 score. However, the precision and F1 score are slightly lower, suggesting challenges in correctly predicting positive cases.
  • The Decision Tree model is trained with parameters such as a minimum of 7 samples per leaf, a minimum impurity decrease of 0.0001, a maximum of 15 leaf nodes, and a maximum depth of 5.

Tuned models with oversampled data¶

In [ ]:
# Checking model's performance
Tuned_Models(x_train_over, y_train_over, "Oversampled data")
Oversampled data 

--------------------------------------------------
Bagging
Parameters: {'n_estimators': 50, 'max_samples': 0.9, 'max_features': 0.7}

Training performance:
   Accuracy  Recall  Precision   F1
0       1.0     1.0        1.0  1.0
Validation performance:
   Accuracy    Recall  Precision        F1
0  0.959526  0.895706   0.858824  0.876877
--------------------------------------------------
Random forest
Parameters: {'n_estimators': 300, 'min_samples_leaf': 1, 'max_samples': 0.6, 'max_features': 'sqrt'}

Training performance:
   Accuracy  Recall  Precision        F1
0  0.999342     1.0   0.995918  0.997955
Validation performance:
   Accuracy    Recall  Precision        F1
0  0.952122  0.877301   0.833819  0.855007
--------------------------------------------------
GBM
Parameters: {'subsample': 0.5, 'n_estimators': 100, 'max_features': 0.7, 'learning_rate': 0.1, 'init': AdaBoostClassifier(random_state=1)}

Training performance:
   Accuracy    Recall  Precision        F1
0  0.967078  0.936475   0.868821  0.901381
Validation performance:
   Accuracy   Recall  Precision        F1
0    0.9615  0.91411   0.856322  0.884273
--------------------------------------------------
Adaboost
Parameters: {'n_estimators': 90, 'learning_rate': 1, 'base_estimator': DecisionTreeClassifier(max_depth=2, random_state=1)}

Training performance:
   Accuracy    Recall  Precision        F1
0  0.992593  0.984631   0.969728  0.977123
Validation performance:
   Accuracy    Recall  Precision        F1
0  0.964462  0.895706   0.884848  0.890244
--------------------------------------------------
Xgboost
Parameters: {'subsample': 1, 'scale_pos_weight': 5, 'n_estimators': 50, 'learning_rate': 0.05, 'gamma': 1}

Training performance:
   Accuracy    Recall  Precision        F1
0  0.908313  0.996926   0.637197  0.777467
Validation performance:
   Accuracy    Recall  Precision        F1
0  0.896841  0.960123   0.614931  0.749701
--------------------------------------------------
dtree
Parameters: {'min_samples_leaf': 1, 'min_impurity_decrease': 0.001, 'max_leaf_nodes': 15, 'max_depth': 4}

Training performance:
   Accuracy  Recall  Precision        F1
0   0.91786   0.875   0.693745  0.773901
Validation performance:
   Accuracy    Recall  Precision        F1
0  0.914116  0.861963   0.685366  0.763587

Observations¶

Bagging:

  • The training performance of the Bagging model shows perfect accuracy, recall, precision, and F1 score, indicating that the model perfectly fits the oversampled training data.
  • The validation performance of the Bagging model demonstrates good accuracy, recall, precision, and F1 score, although slightly lower than the perfect training performance. This suggests reasonable generalization to unseen data.
  • The Bagging model is trained with parameters such as 50 estimators, a maximum of 90% of samples, and a maximum of 70% of features.

Random Forest:

  • The training performance of the Random Forest model shows high accuracy, recall, precision, and F1 score, indicating that the model performs well on the oversampled training data.
  • The validation performance of the Random Forest model demonstrates decent accuracy, recall, precision, and F1 score. However, the precision is slightly lower, suggesting challenges in correctly predicting positive cases.
  • The Random Forest model is trained with parameters such as 300 estimators, a minimum of 1 sample per leaf, a maximum of 60% of samples, and square root of the number of features.

GBM (Gradient Boosting Machine):

  • The training performance of the GBM model shows high accuracy, recall, precision, and F1 score, indicating good performance on the oversampled training data.
  • The validation performance of the GBM model also demonstrates high accuracy, recall, precision, and F1 score, suggesting good generalization ability.
  • The GBM model is trained with parameters such as a subsample of 50%, 100 estimators, a maximum of 70% of features, a learning rate of 0.1, and an AdaBoostClassifier as the initialization.

Adaboost:

  • The training performance of the Adaboost model shows high accuracy, recall, precision, and F1 score, indicating good performance on the oversampled training data.
  • The validation performance of the Adaboost model demonstrates decent accuracy, recall, precision, and F1 score. However, the precision is slightly lower, suggesting challenges in correctly predicting positive cases.
  • The Adaboost model is trained with parameters such as 90 estimators, a learning rate of 1, and a DecisionTreeClassifier with a maximum depth of 2 as the base estimator.

Xgboost:

  • The training performance of the Xgboost model shows decent accuracy, recall, precision, and F1 score, indicating reasonable performance on the oversampled training data.
  • The validation performance of the Xgboost model demonstrates lower accuracy, recall, precision, and F1 score compared to other models, suggesting challenges in generalization to unseen data.
  • The Xgboost model is trained with parameters such as a subsample of 100%, a scale_pos_weight of 5, 50 estimators, a learning rate of 0.05, and a gamma value of 1.

Decision Tree (dtree):

  • The training performance of the Decision Tree model shows decent accuracy, recall, precision, and F1 score, indicating reasonable performance on the oversampled training data.
  • The validation performance of the Decision Tree model also demonstrates decent accuracy, recall, precision, and F1 score. However, the precision and F1 score are slightly lower, suggesting challenges in correctly predicting positive cases.
  • The Decision Tree model is trained with parameters such as a minimum of 1 sample per leaf, a minimum impurity decrease of 0.001, a maximum of 15 leaf nodes, and a maximum depth of 4.

Tuned models with undersampled data¶

In [ ]:
Tuned_Models(x_train_un, y_train_un, "Undersampled data")
Undersampled data 

--------------------------------------------------
Bagging
Parameters: {'n_estimators': 70, 'max_samples': 1, 'max_features': 0.8}

Training performance:
   Accuracy  Recall  Precision       F1
0  0.160658     1.0   0.160658  0.27684
Validation performance:
   Accuracy  Recall  Precision        F1
0  0.160908     1.0   0.160908  0.277211
--------------------------------------------------
Random forest
Parameters: {'n_estimators': 300, 'min_samples_leaf': 1, 'max_samples': 0.6, 'max_features': 'sqrt'}

Training performance:
   Accuracy  Recall  Precision        F1
0  0.943374     1.0   0.739394  0.850174
Validation performance:
   Accuracy    Recall  Precision        F1
0   0.92695  0.929448   0.707944  0.803714
--------------------------------------------------
GBM
Parameters: {'subsample': 0.7, 'n_estimators': 125, 'max_features': 0.7, 'learning_rate': 0.2, 'init': AdaBoostClassifier(random_state=1)}

Training performance:
   Accuracy  Recall  Precision        F1
0  0.961646     1.0   0.807279  0.893364
Validation performance:
   Accuracy   Recall  Precision        F1
0  0.946693  0.96319   0.765854  0.853261
--------------------------------------------------
Adaboost
Parameters: {'n_estimators': 70, 'learning_rate': 0.1, 'base_estimator': DecisionTreeClassifier(max_depth=2, random_state=1)}

Training performance:
   Accuracy    Recall  Precision        F1
0  0.930041  0.969262   0.705444  0.816573
Validation performance:
   Accuracy    Recall  Precision       F1
0  0.929911  0.972393   0.704444  0.81701
--------------------------------------------------
Xgboost
Parameters: {'subsample': 0.7, 'scale_pos_weight': 10, 'n_estimators': 250, 'learning_rate': 0.2, 'gamma': 3}

Training performance:
   Accuracy  Recall  Precision        F1
0  0.929547     1.0   0.695157  0.820168
Validation performance:
   Accuracy    Recall  Precision       F1
0  0.909181  0.984663      0.642  0.77724
--------------------------------------------------
dtree
Parameters: {'min_samples_leaf': 7, 'min_impurity_decrease': 0.0001, 'max_leaf_nodes': 15, 'max_depth': 5}

Training performance:
   Accuracy    Recall  Precision        F1
0  0.870453  0.939549   0.557447  0.699733
Validation performance:
   Accuracy    Recall  Precision        F1
0  0.867226  0.920245   0.552486  0.690449

Observations¶

Bagging:

  • The training performance of the Bagging model shows low accuracy, precision, and F1 score, but a perfect recall. This indicates that the model is able to capture all positive cases but has low precision and struggles to correctly predict negative cases.
  • The validation performance of the Bagging model is similar to the training performance, with low accuracy, precision, and F1 score, but a perfect recall.

Random Forest:

  • The training performance of the Random Forest model shows decent accuracy, precision, and F1 score, but a perfect recall. The model performs reasonably well on the undersampled training data.
  • The validation performance of the Random Forest model demonstrates decent accuracy, recall, precision, and F1 score. However, the recall is slightly lower, suggesting challenges in correctly identifying positive cases.

GBM (Gradient Boosting Machine):

  • The training performance of the GBM model shows high accuracy, recall, precision, and F1 score, indicating good performance on the undersampled training data.
  • The validation performance of the GBM model also demonstrates high accuracy, recall, precision, and F1 score, suggesting good generalization ability.

Adaboost:

  • The training performance of the Adaboost model shows decent accuracy, recall, precision, and F1 score. The model performs reasonably well on the undersampled training data.
  • The validation performance of the Adaboost model demonstrates decent accuracy, recall, precision, and F1 score. However, the precision is slightly lower, suggesting challenges in correctly predicting positive cases.

Xgboost:

  • The training performance of the Xgboost model shows decent accuracy, precision, and F1 score, but a perfect recall. The model performs reasonably well on the undersampled training data.
  • The validation performance of the Xgboost model demonstrates lower accuracy, recall, precision, and F1 score compared to other models, suggesting challenges in generalization to unseen data.

Decision Tree (dtree):

  • The training performance of the Decision Tree model shows lower accuracy, recall, precision, and F1 score compared to other models, indicating challenges in capturing patterns in the undersampled training data.
  • The validation performance of the Decision Tree model also demonstrates lower accuracy, recall, precision, and F1 score compared to other models, suggesting challenges in generalization to unseen data.

Model Comparison and Final Model Selection¶

In [ ]:
for name, model, params in Model_List:
    print('-'*50)
    print(name)
    Model_Results = [result for result in Results if result['name'] == name]
    training = []
    validation = []
    # Grouping by model 
    for result in Model_Results:
        # Grouping training results
        training.append(result['training'].T)
        # Grouping validation results
        validation.append(result['validation'].T)
    models_train = pd.concat(
        training,
        axis=1,
    )
    # Columns names
    columns = ['Original data', 'Oversampled data', 'Undersampled data']
    models_train.columns = columns
    print('\n', 'Training')
    print(models_train)
    models_validation = pd.concat(
        validation,
        axis=1,
    )
    models_validation.columns = columns
    print('\n', 'Validation')
    print(models_validation)
--------------------------------------------------
Bagging

 Training
           Original data  Oversampled data  Undersampled data
Accuracy        0.999012               1.0           0.160658
Recall          0.995902               1.0           1.000000
Precision       0.997947               1.0           0.160658
F1              0.996923               1.0           0.276840

 Validation
           Original data  Oversampled data  Undersampled data
Accuracy        0.963475          0.959526           0.160908
Recall          0.855828          0.895706           1.000000
Precision       0.911765          0.858824           0.160908
F1              0.882911          0.876877           0.277211
--------------------------------------------------
Random forest

 Training
           Original data  Oversampled data  Undersampled data
Accuracy        0.997695          0.999342           0.943374
Recall          0.985656          1.000000           1.000000
Precision       1.000000          0.995918           0.739394
F1              0.992776          0.997955           0.850174

 Validation
           Original data  Oversampled data  Undersampled data
Accuracy        0.955577          0.952122           0.926950
Recall          0.782209          0.877301           0.929448
Precision       0.930657          0.833819           0.707944
F1              0.850000          0.855007           0.803714
--------------------------------------------------
GBM

 Training
           Original data  Oversampled data  Undersampled data
Accuracy        0.989794          0.967078           0.961646
Recall          0.956967          0.936475           1.000000
Precision       0.979036          0.868821           0.807279
F1              0.967876          0.901381           0.893364

 Validation
           Original data  Oversampled data  Undersampled data
Accuracy        0.974334          0.961500           0.946693
Recall          0.895706          0.914110           0.963190
Precision       0.941935          0.856322           0.765854
F1              0.918239          0.884273           0.853261
--------------------------------------------------
Adaboost

 Training
           Original data  Oversampled data  Undersampled data
Accuracy        0.995062          0.992593           0.930041
Recall          0.979508          0.984631           0.969262
Precision       0.989648          0.969728           0.705444
F1              0.984552          0.977123           0.816573

 Validation
           Original data  Oversampled data  Undersampled data
Accuracy        0.966930          0.964462           0.929911
Recall          0.868098          0.895706           0.972393
Precision       0.921824          0.884848           0.704444
F1              0.894155          0.890244           0.817010
--------------------------------------------------
Xgboost

 Training
           Original data  Oversampled data  Undersampled data
Accuracy        0.996379          0.908313           0.929547
Recall          1.000000          0.996926           1.000000
Precision       0.977956          0.637197           0.695157
F1              0.988855          0.777467           0.820168

 Validation
           Original data  Oversampled data  Undersampled data
Accuracy        0.971372          0.896841           0.909181
Recall          0.957055          0.960123           0.984663
Precision       0.876404          0.614931           0.642000
F1              0.914956          0.749701           0.777240
--------------------------------------------------
dtree

 Training
           Original data  Oversampled data  Undersampled data
Accuracy        0.938765          0.917860           0.870453
Recall          0.805328          0.875000           0.939549
Precision       0.811983          0.693745           0.557447
F1              0.808642          0.773901           0.699733

 Validation
           Original data  Oversampled data  Undersampled data
Accuracy        0.930405          0.914116           0.867226
Recall          0.782209          0.861963           0.920245
Precision       0.784615          0.685366           0.552486
F1              0.783410          0.763587           0.690449

Observations¶

Bagging:

  • The Bagging model performs exceptionally well on the original and oversampled datasets, achieving high accuracy, recall, precision, and F1 score.
  • However, the model's performance significantly degrades on the undersampled dataset, with extremely low accuracy, precision, and F1 score.

Random Forest:

  • The Random Forest model shows consistent and strong performance on all three datasets, with high accuracy, recall, precision, and F1 score.
  • The model maintains its performance even when dealing with imbalanced datasets.

GBM:

  • The GBM model performs well on all three datasets, with high accuracy, recall, precision, and F1 score.
  • However, there is a slight drop in performance on the undersampled dataset compared to the original and oversampled datasets.

Adaboost:

  • The Adaboost model demonstrates good performance on the original and oversampled datasets, achieving high accuracy, recall, precision, and F1 score.
  • Similar to GBM, there is a drop in performance on the undersampled dataset, particularly in terms of precision and F1 score.

Xgboost:

  • The Xgboost model exhibits high performance on the original and oversampled datasets, achieving high accuracy, recall, precision, and F1 score.
  • However, there is a significant drop in performance on the undersampled dataset, especially in terms of precision and F1 score.

Decision Tree (dtree):

  • The Decision Tree model shows relatively lower performance compared to other models on all three datasets, particularly on the undersampled dataset.
  • The model struggles with capturing patterns in the imbalanced datasets, leading to lower accuracy, recall, precision, and F1 score.

Test set final performance¶

XGBoost, GBM, Adaboost models trained with undersampled data have a generalised performance

In [ ]:
results_selected_models = []


# Selecting the model
def selected_model(name):
    model = [model for model in Results if model['name']
             == name and model['description'] == 'Undersampled data'][0]['model']
    results = model_performance_classification_sklearn(model, x_test, y_test)
    results_selected_models.append(results.T)

# List of top 3 models
selected_models = ['Xgboost', 'Adaboost', 'GBM']

for model_name in selected_models:
    selected_model(model_name)

#Grouping result of each model
selected_models_results = pd.concat(
    results_selected_models,
    axis=1
)

selected_models_results.columns = selected_models
print('Final results\n')
print(selected_models_results)
Final results

            Xgboost  Adaboost       GBM
Accuracy   0.905726  0.926456  0.943731
Recall     0.990769  0.972308  0.972308
Precision  0.631373  0.692982  0.750594
F1         0.771257  0.809219  0.847185

Observations¶

Based on the performance of the different models, the GBM (Gradient Boosting Machine) model consistently shows the highest accuracy, recall, precision, and F1 score across all three data sampling techniques (original, oversampled, and undersampled). The GBM model achieves strong performance in terms of correctly predicting positive cases (recall) and precision in identifying true positive cases. It also maintains good accuracy and overall F1 score, indicating a balance between precision and recall.

Feature Importance¶

In [ ]:
#Selecting the model
model = [model for model in Results if model['name']
         == 'GBM' and model['description'] == 'Undersampled data'][0]['model']

# Features names
feature_names = x_train.columns
# Getting the importances
importances = model.feature_importances_
indices = np.argsort(importances)

# Plot the feature importances
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)),
         importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

Observations¶

  • Total_Trans_Ct and Total_Trans_Amt are the two most important variables

Business Insights and Conclusions¶

Considering the consistently strong performance of the GBM model across all data sampling techniques and its ability to achieve a good balance between precision and recall, it would be a suitable choice as the final model for predicting customer attrition and improving the bank's services. However, further analysis and evaluation, such as assessing the model's stability and robustness, should be conducted before making a final decision.